Back

Frontiers in Bioinformatics

Frontiers Media SA

All preprints, ranked by how well they match Frontiers in Bioinformatics's content profile, based on 45 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit. Older preprints may already have been published elsewhere.

1
Assessing Immune Microenvironment in TCGA-LUAD via CIBERSORTx Using Single-Cell Derived Signature Matrix and ESTIMATE Algorithm

Verma, M.

2024-05-10 bioinformatics 10.1101/2024.05.08.592760 medRxiv
Top 0.1%
14.4%
Show abstract

Lung cancer (LC) remains a significant global health concern, affecting millions worldwide each year. Tumor-infiltrating immune cells (TIICs) play a crucial role in Lung Cancer progression and prognosis, with various immune cell types infiltrating the tumor microenvironment. Traditional methods like immunohistochemistry and flow cytometry have limitations in accurately profiling TIIC subtypes. However, recent advancements in single-cell RNA sequencing and computational algorithms like CIBERSORTx offer a promising approach for characterizing TIICs in bulk tumor samples. In this study, we undertook the validation of the signature matrix comprising 14 distinct immune cell types and subtypes, which was originally derived from PBMC single-cell RNA-seq data, in our previous work (Verma, 2024). The positive controls included 8 bulk RNA-seq samples of whole blood and specific immune cell bulk RNA-seq samples, while the negative control comprised neuroblastoma cell lines lacking immune content. Subsequently, we applied this signature matrix to deconvolute TCGA-LUAD data (n = 598), and assessed tumor purity and immune-stromal content using the ESTIMATE algorithm. Our findings indicate that the signature matrix accurately reflected flow cytometry-derived fractions, supported by correlation analysis. Specifically, the second positive control and negative control accurately reflected immune and non-immune sample fractions, respectively, further validating the efficacy of our approach. This study also provide insights into the invasion of immunocytes in lung adenocarcinoma and highlight the potential of computational tools like CIBERSORTx and ESTIMATE in characterizing the immune microenvironment of LC.

2
Addressing persistent challenges in digital image analysis of cancerous tissues

Prabhakaran, S.; Yapp, C.; Baker, G. J.; Beyer, J.; Chang, Y. H.; Creason, A. L.; Krueger, R.; Muhlich, J.; Patterson, N. H.; Sidak, K.; Sudar, D.; Taylor, A. J.; Ternes, L.; Troidl, J.; Yubin, X.; Sokolov, A.; Tyson, D. R.; Participants of the Cell Imaging Hackathon 2022,

2023-07-24 bioinformatics 10.1101/2023.07.21.548450 medRxiv
Top 0.1%
9.2%
Show abstract

The National Cancer Institute (NCI) supports many research programs and consortia, many of which use imaging as a major modality for characterizing cancerous tissue. A trans-consortia Image Analysis Working Group (IAWG) was established in 2019 with a mission to disseminate imaging-related work and foster collaborations. In 2022, the IAWG held a virtual hackathon focused on addressing challenges of analyzing high dimensional datasets from fixed cancerous tissues. Standard image processing techniques have automated feature extraction, but the next generation of imaging data requires more advanced methods to fully utilize the available information. In this perspective, we discuss current limitations of the automated analysis of multiplexed tissue images, the first steps toward deeper understanding of these limitations, what possible solutions have been developed, any new or refined approaches that were developed during the Image Analysis Hackathon 2022, and where further effort is required. The outstanding problems addressed in the hackathon fell into three main themes: 1) challenges to cell type classification and assessment, 2) translation and visual representation of spatial aspects of high dimensional data, and 3) scaling digital image analyses to large (multi-TB) datasets. We describe the rationale for each specific challenge and the progress made toward addressing it during the hackathon. We also suggest areas that would benefit from more focus and offer insight into broader challenges that the community will need to address as new technologies are developed and integrated into the broad range of image-based modalities and analytical resources already in use within the cancer research community.

3
Can a Sparse 29 x 29 Pixel Chaos Game Representation Predict Protein Binding Sites using Fine-Tuned State-of-the Art Deep Learning Semantic Segmentation Models?

Dick, K.; Green, J. R.

2023-08-04 bioinformatics 10.1101/2023.08.04.410498 medRxiv
Top 0.1%
9.1%
Show abstract

No. While our experiments ultimately failed, this work was motivated by the seemingly reasonable hypothesis that encoding protein sequences as a fractal-based image in combination with a binary mask identifying those pixels representative of the protein binding interface could effectively be used to fine-tune a semantic segmentation model. We were wrong. Despite the shortcomings of this work, a number of insights were drawn, inspiring discussion about how this fractal-based space may be exploited to generate effective protein binding site predictors in the future. Furthermore, these realizations promise to orient complimentary studies leveraging fractal-based representations, whether in the field of bioinformatics, or more broadly within disparate fields leveraging sequence-type data, such as Natural Language Processing. In a non-traditional way, this work presents the experimental design undertaken and interleaves various insights and limitations. It is the hope of this work that those interested in leveraging fractal-based representations and deep learning architectures as part of their work will benefit from the insights arising from this work.

4
Immune Infiltration and Survival Analysis in Lung Adenocarcinoma: Identifying Prognostic Cell Types

Verma, M.

2024-05-22 bioinformatics 10.1101/2024.05.11.593151 medRxiv
Top 0.1%
8.2%
Show abstract

The tumor microenvironment (TME) plays a crucial role in the development and survival of neoplastic cells, with tumor-infiltrating leukocytes (TILs) constituting a significant component. This immune infiltrate exhibits a diverse composition of adaptive and innate immunological cell subtypes, with varying prognostic implications across different cancer types. Recent advancements in immunotherapy underscore the importance of evaluating TILs as potential biological identifiers, particularly in the context of novel treatment strategies. In lung adenocarcinoma, the most prevalent histological subtype of lung cancer, multiple immune cell types have been identified within the TME, influencing tumor classification, clinical outcomes, and patient survival. While prior research has demonstrated a correlation between tumor-infiltrating immune cells and the progression of lung adenocarcinoma, few studies have examined their prognostic implications comprehensively. Building upon our previous work, where we constructed a signature matrix (Verma, 2024b.) and evaluated the fractions of 14 immune cell types in TCGA-LUAD data and performed ESTIMATE analysis to assess immune infiltration, stromal infiltration, and tumor purity (Verma, 2024a.), in this study, we investigate the association between immune cell infiltration patterns and the overall survival and prognosis of TCGA-LUAD patients across different histological subtypes and stages. Our findings aim to elucidate the immune cell types positively or negatively impacting patient outcomes in lung adenocarcinoma and inform future therapeutic approaches.

5
A Novel Machine Learning Systematic Framework and Web Tool for Breast Cancer Survival Rate Assessment

Ji, J. M.; Shen, W. H.

2022-09-17 oncology 10.1101/2022.09.16.22280052 medRxiv
Top 0.1%
6.8%
Show abstract

Cancer research, including that of breast cancer, has increasingly relied on molecular profiling based on advances in genomic technology. Although these techniques have permitted scientists to unravel the process by which cancer develops, scientists still struggle to effectively translate the vast amounts of patient data into clinically meaningful results. As a result, tasks such as predicting the human response to differing treatments remains a major challenge in cancer treatment. There have been many studies attempting to determine the survival indicators of breast cancer patients. However, most of these analyses were predominantly performed using traditional statistical methods, which are imperfect and inadequate in tackling vast amounts of data or unstructured data on human breast cancer. With the exponential progress in computing power and artificial intelligence approaches, we believe that there is an opportunity for machine learning to supersede our current capabilities in fully understanding the correlations between geneset alterations, drug responses, and the prognosis of breast cancer patients. This information would greatly benefit scientists and physicians in developing clinical therapeutic strategies, such as performing personalized treatment. This machine learning project employs multiple machine learning approaches, including a novel deep learning algorithm, in building models for the detection and visualization of significant prognostic indicators of breast cancer patient survival rate. The clinical and genomic data of 1,980 primary breast cancer samples used in this project were obtained from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database of cBioPortal. The data was preprocessed and then split to train eight classical machine learning models and the aforementioned deep learning Convolutional Neural Network (CNN) model. These models were evaluated using the recall scores, the accuracy scores, the receiver operating characteristic (ROC) curve, and the area under the ROC curve (AUC) on the training dataset and confirmed using the rest of the data of the dataset. Both the deep learning and machine learning methods produced desirable prediction accuracies. However, the deep learning model noticeably outperformed all other classifiers and achieved the highest accuracy (AUC = 0.900). This project was constructed in the Google Colab environment based on python and its programming libraries with data visualization, Tensorflow, and Keras. The CNN model demonstrates a powerful ability to be used as a systematic framework for real time prediction by end users. A web application for the breast cancer survival rate prediction was designed and developed using streamlit, Tensorflow, Keras and python libraries to allow end-users to interact with the model with ease and obtain quick and accurate prediction.

6
Pilot study demonstrating changes in DNA hydroxymethylation enable detection of multiple cancers in plasma cell-free DNA

Bergamaschi, A.; Ning, Y.; Ku, C.-J.; Ellison, C.; Collin, F.; Guler, G.; Phillips, T.; McCarthy, E.; Wang, W.; Antoine, M.; Scott, A.; Lloyd, P.; Ashworth, A.; Quake, S.; Levy, S.

2020-01-27 genetic and genomic medicine 10.1101/2020.01.22.20018382 medRxiv
Top 0.1%
6.4%
Show abstract

Our study employed the detection of 5-hydroxymethyl cytosine (5hmC) profiles on cell free DNA (cfDNA) from the plasma of cancer patients using a novel enrichment technology coupled with sequencing and machine learning based classification method. These classification methods were develoiped to detect the presence of disease in the plasma of cancer and control subjects. Cancer and control patient cfDNA cohorts were accrued from multiple sites consisting of 48 breast, 55 lung, 32 prostate and 53 pancreatic cancer subjects. In addition, a control cohort of 180 subjects (non-cancer) was employed to match cancer patient demographics (age, sex and smoking status) in a case-control study design. Logistic regression methods applied to each cancer case cohort individually, with a balancing non-cancer cohort, were able to classify cancer and control samples with measurably high performance. Measures of predictive performance by using 5-fold cross validation coupled with out-of-fold area under the curve (AUC) measures were established for breast, lung, pancreatic and prostate cancer to be 0.89, 0.84, 0.95 and 0.83 respectively. The genes defining each of these predictive models were enriched for pathways relevant to disease specific etiology, notably in the control of gene regulation in these same pathways. The breast cancer cohort consisted primarily of stage I and II patients, including tumors < 2 cm and these samples exhibited a high cancer probability score. This suggests that the 5hmC derived classification methodology may yield epigenomic detection of early stage disease in plasma. Same observation was made for the pancreatic dataset where >50% of cancers were stage I and II and showed the highest cancer probability score.

7
Direct pathway enrichment prediction from histopathological whole slide images and comparison with gene expression mediated models

Jabin, A.; Ahmad, S.

2026-03-04 bioinformatics 10.64898/2026.03.02.709137 medRxiv
Top 0.1%
6.4%
Show abstract

Molecular profiling of tumours via RNA sequencing (RNA-seq) enables clinically actionable stratification but remains costly, tissue-intensive, and time-consuming. Recent advances in computational pathology suggest that routine H&E whole-slide images (WSIs) can be utilized to estimate transcriptomic states of cancer cells. Given the WSI-derived predictions of transcriptional signatures are noisy, their use for accurate biological interpretation faces challenges. On the other hand pathway enrichment analysis has been routinely used in describing biologically meaningful cellular states from noisy gene expression data and some studies have evaluated the ability of WSI-predicted gene expression profiles to reconstruct enriched pathways in experiments where the two data modalities were concurrently available. However, it remains unclear if a predictive model that is designed to predict enriched pathways directly from WSI samples would be better than the current approaches to do so by first predicting gene expressions. Here, we develop and evaluate these two complementary approaches for predicting pathway enrichment profiles from WSIs in TCGA Breast Invasive Carcinoma (TCGA-BRCA) by training parallel models which predict pathway enrichment directly from image features and those which rely on predicted gene expression profiles, which is the current state-of-the-art. Our results suggest that under controlled experiments direct prediction of a selected pool of enriched pathways outperforms the models trained on predicting gene expression and then inferring enrichments on predicted gene expression values. These findings will be helpful in prioritizing the goals of predictive modeling of WSI images and improving diagnostic outcomes of cancer patients.

8
Prediction models with survival data: a comparison between machine learning and the Cox proportional hazards model

Hazewinkel, A.-D.; Gelderblom, H.; Fiocco, M.

2022-04-02 oncology 10.1101/2022.03.29.22273112 medRxiv
Top 0.1%
6.4%
Show abstract

Recent years have seen increased interest in using machine learning (ML) methods for survival prediction, chiefly using big datasets with mixed datatypes and/or many predictors Model comparisons have frequently been limited to performance measure evaluation, with the chosen measure often suboptimal for assessing survival predictive performance. We investigated ML model performance in an application to osteosarcoma data from the EURAMOS-1 clinical trial (NCT00134030). We compared the performance of survival neural networks (SNN), random survival forests (RSF) and the Cox proportional hazards model. Three performance measures suitable for assessing survival model predictive performance were considered: the C-index, and the time-dependent Brier and Kullback-Leibler scores. Comparisons were also made on predictor importance and patient-specific survival predictions. Additionally, the effect of ML model hyper-parameters on performance was investigated. All three models had comparable performance as assessed by the C-index and Brier and Kullback-Leibler scores, with the Cox model and SNN also comparable in terms of relative predictor importance and patient-specific survival predictions. RSFs showed a tendency for according less importance to predictors with uneven class distributions and predicting clustered survival curves, the latter a result of tuning hyperparameters that influence forest shape through restrictions on terminal node size and tree depth. SNNs were comparatively more sensitive to hyperparameter misspecification, with decreased regularization resulting in inconsistent predicted survival probabilities. We caution against using RSF for predicting patient-specific survival, as standard model tuning practices may result in aggregated predictions, which is not reflected in performance measure values, and recommend performing multiple reruns of SNNs to verify prediction consistency.

9
Variation in bulk RNA-seq and estimated cell type proportion using deconvolution when comparing pancreatic cancer samples within the same individual

Jansen, R. J.; Munro, S. A.; Antwi, S. O.; Rabe, K. G.; Sicotte, H.

2025-05-06 genetic and genomic medicine 10.1101/2025.05.05.25326976 medRxiv
Top 0.1%
6.3%
Show abstract

Introduction: There is great promise in using genomic data to inform individual cancer treatment plans. Assessing intratumor genetic heterogeneity, studies have shown it may be possible to target biopsies to tumor subclones driving disease progression or treatment resistance. Here, we explore if the interpretation of tumor gene expression analysis varies across two specimens from the same patient. Methods: We performed bulk RNA-seq using FFPE samples from 16 patients who also had a previous separate bulk RNA-seq performed and deposited in TCGA. We used three different deconvolution methods to compare cell type proportions for these paired data. We normalized study-specific gene expression values per gene by calculating transcripts per million and adjusted for batch effect across study to compare median expression values. We also compared the reliability of gene expression measurements. We selected KRAS, TP53, SMAD4, and CDKN2A, as the most mutated genes in pancreatic cancer, and CTNNB1, JUN, SMAD3, SMAD7, and TCF7, as these tend to be enriched in pancreatic cancer compared with adjacent normal tissue. Results: We found that average cell type proportion varied the most between studies (i.e., samples for each patient) for NK and macrophages (using adjusted p-value 0.05/21=0.002). For the differential expression analysis, we did not observe significant differences in average expression of any of the selected genes. We observed substantial (kappa=0.75) for only JUN with low to moderate concordance (i.e., Kappa value 0.25-0.5) when using a median cut point for the remaining 8 genes across the two studies. Discussion: Together, the findings suggest that more than one tumor sample may be needed for effective treatment planning. Any potential difference in observed expression values across the paired samples could be related to the different cell type proportions across the samples. The sample size was small, and each study used different sequencing technologies, so any interpretation should be confirmed with additional studies.

10
Persistent Homology in Medical Image Processing: A Literature Review

Brito-Pacheco, D. A.; Reyes-Aldasoro, C. C.; Giannopoulos, P.

2025-02-25 health informatics 10.1101/2025.02.21.25322669 medRxiv
Top 0.1%
6.3%
Show abstract

Medical image analysis has experienced significant advances with the integration of machine learning, deep learning, and other mathematical and computational methodologies into the pipelines of data analysis. One methodology that has received less attention is Persistent Homology (PH), which comes from the growing field of Topological Data Analysis and has the ability to extract features from data at different scales and build multi-scale summaries. In this work, we present a systematic review of PH applied in medical images. To illustrate the potential of PH, we introduce the main concepts of PH and demonstrate with an example of histopathology. Fifteen articles where PH was applied to medical image analysis tasks such as segmentation and classification were selected and reviewed. It was observed that PH is very versatile, as it can be applied in many different contexts and to different data types, whilst also showing great potential in increasing model accuracy in both classification and segmentation. It was also observed that image segmentation predominantly uses basic level-set filtration to calculate PH, while classification takes various approaches using filtration on more complex structures built from data. This review highlights PH as an important tool that can further advance medical image analysis.

11
More Structures, Less Accuracy: ESM3's Binding Prediction Paradox

Loux, T.; Wang, D.; Shakhnovich, E.

2024-12-09 molecular biology 10.1101/2024.12.09.627585 medRxiv
Top 0.1%
6.3%
Show abstract

This paper investigates the impact of incorporating structural information into the protein-protein interaction predictions made by ESM3, a multimodal protein language model (pLM). We utilized various structural variants as inputs and compared three widely used structure acquisition pipelines--EvoEF2, Gromacs, and Rosetta Relax--to assess their effects on ESM3s performance. Our findings reveal that the use of a consistent identical structure, regardless of whether it is relaxed or variant, consistently enhances model performance across various datasets. This improvement is striking in few-show learning. However, performance deteriorates when different relaxed mutant structures are used for each variant. Based on these results, we advise caution when integrating distinct mutant structures into ESM3 and similar models.This study highlights the critical need for careful consideration of structural inputs in protein binding affinity prediction.

12
Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Unsal, S.; Atas, H.; Albayrak, M.; Turhan, K.; Acar, A. C.; Dogan, T.

2020-10-28 bioinformatics 10.1101/2020.10.28.359828 medRxiv
Top 0.1%
4.9%
Show abstract

Data-centric approaches have been utilized to develop predictive methods for elucidating uncharacterized aspects of proteins such as their functions, biophysical properties, subcellular locations and interactions. However, studies indicate that the performance of these methods should be further improved to effectively solve complex problems in biomedicine and biotechnology. A data representation method can be defined as an algorithm that calculates numerical feature vectors for samples in a dataset, to be later used in quantitative modelling tasks. Data representation learning methods do this by training and using a model that employs statistical and machine/deep learning algorithms. These novel methods mostly take inspiration from the data-driven language models that have yielded ground-breaking improvements in the field of natural language processing. Lately, these learned data representations have been applied to the field of protein informatics and have displayed highly promising results in terms of extracting complex traits of proteins regarding sequence-structure-function relations. In this study, we conducted a detailed investigation over protein representation learning methods, by first categorizing and explaining each approach, and then conducting benchmark analyses on; (i) inferring semantic similarities between proteins, (ii) predicting ontology-based protein functions, and (iii) classifying drug target protein families. We examine the advantages and disadvantages of each representation approach over the benchmark results. Finally, we discuss current challenges and suggest future directions. We believe the conclusions of this study will help researchers in applying machine/deep learning-based representation techniques on protein data for various types of predictive tasks. Furthermore, we hope it will demonstrate the potential of machine learning-based data representations for protein science and inspire the development of novel methods/tools to be utilized in the fields of biomedicine and biotechnology.

13
JRSeek: Artificial Intelligence Meets Jelly Roll Fold Classification in Viruses

Sanchez, J. E.; Guo, W.; Li, C.; Li, L. E.; Xiao, C.

2025-01-29 bioinformatics 10.1101/2025.01.27.635132 medRxiv
Top 0.1%
4.9%
Show abstract

The jelly roll (JR) fold is the most common structural motif found in the capsid and nucleocapsid of viruses. Its pervasiveness across many different viral families motives developing a tool to predict its presence from a sequence. In the current work, logistic regression (LR) models trained on six different large language model (LLM) embeddings exhibited over 95% accuracy in differentiating JR from non-JR sequences. The dataset used for training and testing included sequences from single JR viruses, non-JR viruses, and non-virus immunoglobulin-like {beta}-sandwich (IGLBS) proteins which closely resemble the JR fold in structure. The high accuracy is particularly remarkable given the low sequence similarity across viral families and the balanced nature of the dataset. Also, the accuracy of the models was independent of LLM embeddings, suggesting that peak accuracy for predicting viral JR folds hinges more on the data quality and quantity rather than on the specific mathematical models used. Given that many viral capsid and nucleocapsid structures have yet to be resolved, using sequence-based LLMs is a promising strategy that can readily be applied to available data. Principal Component Analysis of the Bert-U100 embeddings demonstrates that most IGLBS sequences and a subset of JR and non-JR sequences are distinguishable even before the application of the LR model, but the LR model is necessary to differentiate a subset of more ambiguous sequences. When applied to double JR folds, the Bert-U100 model was able to assign the JR motif for some viral families, providing evidence for the models generalizability. However, for other families, this generalizability was not observed, motivating a future need to develop other models informed by double JR folds. Lastly, the Bert-U100 model was also able to predict whether sequences from a dataset of unclassified viruses produce the JR fold. Two examples are given and the JR predictions are corroborated by AlphaFold3. Altogether, this work demonstrates that JR folds can, in principle, be predicted from their sequences.

14
An Improved Dataset for Predicting Mammal Infecting Viruses from Genetic Sequence Information

Reddy, T.; Schneider, A.; Hall, A. R.; Witmer, A.; Hengartner, N.

2026-01-25 bioinformatics 10.1101/2025.09.17.676952 medRxiv
Top 0.1%
4.9%
Show abstract

There have been several attempts to develop machine learning (ML) models to identify human infecting viruses from their genomic sequences, with varying degrees of success. Direct comparison between models is problematic, because these models are typically trained and evaluated on different datasets with alter-native data splitting schemes, features, and model performance metrics. In this paper we present a standardized dataset of mammal infecting and non-infecting viral pathogens, refined from the previous work of Mollentze et al. to include the latest literature evidence, roughly doubling the number of curated host-virus records available to the community, and new host target labels, primate and mammal. The new host labels were included for several reasons, including previous reports that classification performance is better at broader taxonomic ranks and the idea that there may be more data for primate infection that might serve as a suitable proxy for zoonotic potential and avoidance of false positives for human infection due to absence of evidence. On this dataset, we report the performance of eight machine learning models for predicting mammal-infecting viruses from their genomic sequences. We find that randomly assigning cases in our improved dataset to training/testing sets, when compared to the original assignments into training/testing in Mollentze et al., increases the overall average ROC AUC of prediction of human infection from 0.663 {+/-} 0.070 to 0.784 {+/-} 0.013, consistent with the reduction in phylogenetic distance between train and test sets (relative entropy change from 3.00 to 0.08). The broadest host category of mammal infection can be predicted most reliably at 0.850 {+/-} 0.020. We share our improved dataset and code to enable standardized comparisons of machine learning methods to predict human host infections. Overall, we have presented preliminary evidence that classification of virus host infection is more tractable at higher taxonomic ranks, that unsurprisingly reducing the phylogenetic distance between training and test sets can improve predictive performance, that peptide kmer features appear to be harmful to out of sample model performance, and we are left with the question of whether models for virus host prediction can reasonably be expected to perform well in out of sample scenarios given the likelihood that viruses do not share a common ancestor. Consistent with this concern, when the data is resampled such that there is no overlap between viral families in training and test sets (relative entropy > 24), models perform no bet-ter than random chance at prediction of human infection regardless of whether kmers are included (ROC AUC 0.50 {+/-} 0.08) or not (ROC AUC 0.50 {+/-} 0.04). Author SummaryDetermining whether a virus can infect a human or other animal based on its genetic information is useful for assessing the threat level of circulating and newly emerging viruses. Previous studies in this domain have had access to limited datasets, and in this work we nearly double the amount of manually labelled host data for viral infection, so that others may build on it and improve it further. We use machine learning models to rank the likelihood of human and mammal infection for viruses in this improved dataset. Results are consistent with the determination of host infection being more tractable for broader categories of hosts, like mammals, than for specific species, like humans. This may suggest that the prospects are good for improved future models that first screen viruses based on their likelihood of infecting mammals, and then in a second stage for likelihood of human infection. The most challenging scenarios were for predictions of viruses that were not similar to viruses in the training data, and the question remains whether we can expect reasonable generalization of predictive models to completely new viruses given that, at the time of writing, viruses do not appear to share a common ancestor.

15
MyGESig: a population-specific gene signature improves survival prediction in Malaysian breast cancer patients

Khairi, M. H. F. B.; Wong, Z. L.; Ang, B. H.; Phipps-Tan, J.; Nur Fatin, P.; Pathmanathan, R.; Hoong, S. M.; Mohd Taib, N. A.; Yip, C.-H.; Ho, W. K.; Tai, M. C.; Teo, S.-H.; Cheong, S. C.; Jia-Wern, P.

2025-09-02 genetic and genomic medicine 10.1101/2025.08.28.25334111 medRxiv
Top 0.1%
4.9%
Show abstract

Accurate prognostic models are essential for guiding treatment decisions and improving patient outcomes in breast cancer. To achieve this, population-specific models are needed to account for genetic, clinical, and pathological differences across populations. In this study, the widely used and freely available PREDICT v3.0 breast cancer prognostic model was first validated in the multiethnic Malaysian Breast Cancer (MyBrCa) cohort to assess its performance. Given its only moderate performance in this population, a machine learning workflow was developed to integrate gene expression and clinical information for classifying patients by their 10-year prognosis. A 77-gene signature, termed MyGESig, was derived from the transcriptomes of 258 MyBrCa patients. Using this signature in combination with clinical variables, an ensemble-based model achieved a median area under the receiver-operator characteristic curve (AUROC) of 0.92 in the hold-out testing set and 0.90 in the independent MyBrCa dataset. While the model exhibited poor generalizability in external cohorts, its discriminative performance improved when trained and tested within the same population (median AUROC: 0.71 in METABRIC; 0.84 in SCAN-B), validating the prognostic value of the gene set. Together, these findings demonstrate the value of incorporating population-specific gene expression datasets into prognosis prediction and highlight the need to develop and validate models tailored to diverse populations in breast cancer.

16
Ensembles for improved detection of invasive breast cancer in histological images

Solorzano, L.; Robertson, S.; Hartman, J.; Rantalainen, M.

2023-04-14 bioinformatics 10.1101/2023.04.13.536542 medRxiv
Top 0.1%
4.9%
Show abstract

Accurate detection of invasive breast cancer (IC) can provide decision support to pathologists as well as improve downstream computational analyses, where detection of IC is a first step. Tissue containing IC is characterized by the presence of specific morphological features, which can be learned by convolutional neural networks (CNN). Here, we compare the use of a single CNN model versus an ensemble of several base models with the same CNN architecture, and we evaluate prediction performance as well as variability across ensemble based model predictions. Two in-house datasets comprising 587 WSI are used to train an ensemble of ten InceptionV3 models whose consensus is used to determine the presence of IC. A novel visualization strategy was developed to communicate ensemble agreement spatially. Performance was evaluated in an internal test set with 118 WSIs, and in an additional external dataset (TCGA breast cancer) with 157 WSI. We observed that the ensemble-based strategy outperformed the single CNN-model alternative with respect to accuracy on tile level in 89% of all WSIs in the test set. The overall accuracy was 0.92 (DICE coefficient, 0.90) for the ensemble model, and 0.85 (DICE coefficient, 0.83) for the single CNN alternative in the internal test set. For TCGA the ensemble outperformed the single CNN in 96.8% of the WSI, with an accuracy of 0.87 (DICE coefficient 0.89), the single model provides an accuracy of 0.75 (DICE coefficient 0.78) The results suggest that an ensemble-based modeling strategy for breast cancer invasive cancer detection consistently outperforms the conventional single model alternative. Furthermore, visualization of the ensemble agreement and confusion areas provide direct visual interpretation of the results. High performing cancer detection can provide decision support in the routine pathology setting as well as facilitate downstream computational analyses.

17
Machine Learning Approach to Integrate and Analyse Multiomics data to Identify Actionable Biomarkers for Head and Neck Squamous Cell Carcinoma (HNSCC)

Panchal, K.; Arockia Rajesh Packiam, K.; MAJUMDAR, S.

2025-10-13 genetic and genomic medicine 10.1101/2025.10.09.25335922 medRxiv
Top 0.1%
4.9%
Show abstract

Head and neck squamous cell carcinoma (HNSCC) is ranked sixth among all the common cancers worldwide and is a major cause of death. A molecular understanding of disease progression can aid in timely diagnosis and therapy. This study aims to identify potential HNSCC biomarkers using a machine learning-based approach to integrate and analyse multi-omics data (namely publicly available Human Papillomavirus (HPV) negative patients multiomics datasets from the CPTAC-HNSCC project, including transcriptomics, methylomics, proteomics, and phosphoproteomics). A three-step feature selection method was utilized to identify potential molecular biomarkers using machine learning algorithms. The top 1000 important features (genes) were filtered using Mutual Information, followed by a random forest-based feature importance ranking, and Recursive Feature Elimination with cross-validation coupled with Support Vector Machine (SVM-RFECV) to get a minimal gene set important for machine learning based tumor-normal classification task. To benchmark these top-selected features, Logistic Regression (LogR), Random Forest (RF), Multi-layer perceptron (MLP), and Support Vector Machines (SVC) were used. The prediction performance of classifiers trained on these selected gene sets was evaluated using the accuracy metric, which was then compared against that of models trained on randomly selected gene sets. The entire workflow was repeated 100 times for different random states to establish statistical confidence in the pipeline and the selected gene set. Our integrative approach identified both omics-specific and cross-omics candidate genes with very high classification accuracy, ranging from [~] 95% to 100%. These genes reveal convergent biological processes central to HNSCC pathogenesis, which reinforces the robustness of the methodology used, which can be adopted to analyse similar multiomics datasets for other pathologies and foundational biological questions.

18
Enhancing t-SNE Performance for Biological Sequencing Data through Kernel Selection

Chourasia, P.; Murad, T.; Ali, S.; Patterson, M.

2023-08-22 bioinformatics 10.1101/2023.08.21.554138 medRxiv
Top 0.1%
4.9%
Show abstract

The genetic code for many different proteins can be found in biological sequencing data, which offers vital insight into the genetic evolution of viruses. While machine learning approaches are becoming increasingly popular for many "Big Data" situations, they have made little progress in comprehending the nature of such data. One such area is the t-distributed Stochastic Neighbour Embedding (t-SNE), a generalpurpose approach used to represent high dimensional data in low dimensional (LD) space while preserving similarity between data points. Traditionally, the Gaussian kernel is used with t-SNE. However, since the Gaussian kernel is not data-dependent, it determines each local bandwidth based on one local point only. This makes it computationally expensive, hence limited in scalability. Moreover, it can misrepresent some structures in the data. An alternative is to use the isolation kernel, which is a data-dependent method. However, it has a single parameter to tune in computing the kernel. Although the isolation kernel yields better performance in terms of scalability and preserving the similarity in LD space, it may still not perform optimally in some cases. This paper presents a perspective on improving the performance of t-SNE and argues that kernel selection could impact this performance. We use 9 different kernels to evaluate their impact on the performance of t-SNE, using SARS-CoV-2 "spike" protein sequences. With three different embedding methods, we show that the cosine similarity kernel gives the best results and enhances the performance of t-SNE.

19
Large protein databases reveal structural complementarity and functional locality

Szczerbiak, P.; Szydlowski, L. M.; Wydmanski, W.; Renfrew, P. D.; Koehler Leman, J.; Kosciolek, T.

2024-10-16 bioinformatics 10.1101/2024.08.14.607935 medRxiv
Top 0.1%
4.8%
Show abstract

Recent breakthroughs in protein structure prediction have led to an unprecedented surge in high-quality 3D models, highlighting the need for efficient computational solutions to manage and analyze this wealth of structural data. In our work, we comprehensively examine the structural clusters obtained from the AlphaFold Protein Structure Database (AFDB), a high-quality subset of ESMAtlas, and the Microbiome Immunity Project (MIP). We create a single cohesive low-dimensional representation of the resulting protein space. Our results show that, while each database occupies distinct regions within the protein structure space, they collectively exhibit significant overlap in their functional profiles. High-level biological functions tend to cluster in particular regions, revealing a shared functional landscape despite the diverse sources of data. By creating a single, cohesive low-dimensional representation of protein structure space integrating data from diverse sources, localizing functional annotations within this space, and providing an open-access web-server for exploration, this work offers insights for future research concerning protein sequence-structure-function relationships, enabling various biological questions to be asked about taxonomic assignments, environmental factors, or functional specificity. This approach is generalizable to other or future datasets, enabling further discovery beyond findings presented here.

20
The consequences of variant calling decisions in secondary analyses of cancer sequencing data

Garcia-Prieto, C.; Valencia, A.; Porta-Pardo, E.

2020-01-30 bioinformatics 10.1101/2020.01.29.924860 medRxiv
Top 0.1%
4.8%
Show abstract

The analysis of cancer genomes provides fundamental information about its aetiology, the processes driving cell transformation or potential treatments. The first crucial step in the analysis of any tumor genome is the identification of somatic genetic variants that cancer cells have acquired during their evolution. For that purpose, a wide range of somatic variant callers have been developed in recent years. While there have been some efforts to benchmark somatic variant calling tools and strategies, the extent to which variant calling decisions impact the results of downstream analyses of tumor genomes remains unknown. Here we present a study to elucidate whether different variant callers (MuSE, MuTect2, SomaticSniper, VarScan2) and strategies to combine them (Consensus and Union) lead to different results in these three important downstream analyses of cancer genomics data: identification of cancer driver genes, quantification of mutational signatures and detection of clinically actionable variants. To this end, we tested how the results of these three analyses varied depending on the somatic mutation caller in five different projects from The Cancer Genome Atlas (TCGA). Our results show that variant calling decisions have a significant impact on these downstream analyses, creating important differences in driver genes identification and mutational processes attribution among variant call sets, as well as in the detection of clinically actionable targets. More importantly, it seems that Consensus, a very widely used strategy by the research community, is not the optimal strategy, as it can lead to the loss of some cancer driver genes and actionable mutations. On the other hand, the Union seems to be a legit strategy for some downstream analyses with a robust performance overall. Contact: eduard.porta@bsc.es; alfonso.valencia@bsc.es